In this work, we present Law2Vec. In order to train Law2Vec we used a large amount of legal corpora from various public legal sources in English. The list comprises the following :
- 53,000 pieces of UK legislation (e.g. UK Public General Acts, Local Acts, etc.) published in (http://legislation.gov.uk).
- 62,000 pieces of European legislation (e.g. EU Treaties, Regulations, Directives, etc.) published in Eur-Lex (https://eur-lex.europa.eu/) .
- 5,500 pieces of Canadian legislation (e.g. Consolidated Acts, Constitutional Documents, etc.) published in (http://laws.justice.gc.ca/eng).
- 1,150 pieces of Australian legislation published in (https://www.legislation.gov.au/).
- 800 pieces of English-translated legislation from EU countries (e.g. Finland, Sweden, France, Germany, etc.).
- 780 pieces of English-translated Legislation from Japanese published in (http://www.japaneselawtranslation.go.jp).
- 68 bound volumes of the US Supreme Court decisions from 1998 to 2017, published in (https://www.supremecourt.gov/opinions/boundvolumes.aspx).
- 54 titles of the most recently updated U.S. Code, as presented in (https://www.law.cornell.edu/uscode/text).
The corpus includes in total 123,066 documents which consists of 492M individual words (tokens) including punctuation marks and numbers. The corpus was preprocessed to discard non-UTF8 encoded characters and treat separated words due to different layout styles (e.g. text from PDF documents). The text was sentence tokenised using the nltk library to provide the best possible input for models. All words were lower-cased and all numerical digits where replaced by the character ’D’, as in Chalkidis et al. (2017) for normalization.
We selected to train word2vec models, instead of the most recent fasttext implementation. The main reason is that word2vec still seems to provide bet- ter semantic representation than fasttext, which tend to be highly biased to- wards syntactic information as also the computed n-gram embeddings. Missing words (OOV) is not of concern in most legal-related tasks, as legislators, lawyers and other legal professionals articulate in high quality standards. We empirically observed that legal documents have been consistent by means of misspellings, grammatical-syntactical errors as well as the vocabulary used has been formal and pertinent to the domain.
We trained two individual word2vec models for 100-dimensional and 200- dimensional embeddings using the gensim library.